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Abstract 


This  paper  describes  an  efficient  algorithm  for  the  parallel  solution  of  systems  of 
linear  equations  with  a  block  tridiagonal  coefficient  matrix.  The  algorithm  comprises 
a  multilevel  LU-factorization  based  on  block  cyclic  reduction  and  a  corresponding 
solution  algorithm. 

The  paper  includes  a  general  presentation  of  the  parallel  multilevel  LU-factoriza- 
tion  and  solution  algorithms,  but  the  main  emphaisis  is  on  implementation  principles 
for  a  message  passing  computer  with  hypercube  topology.  Problem  partitioning, 
processor  allocation  and  communication  requirements  are  discussed  for  the  general 
block  tridiagonal  algorithm. 

Band  matrices  can  be  cast  into  block  tridiagonal  form,  and  this  special  but 
important  problem  is  dealt  with  in  detail.  It  is  demonstrated  how  the  efficiency  of 
the  general  block  tridiagonal  multilevel  algorithm  can  be  improved  by  introducing 
the  equivalent  of  two-way  Gaussian  elimination  for  the  first  and  the  last  partitioning 
and  by  carefully  balancing  the  load  of  the  processors.  The  presentation  of  the 
multilevel  band  solver  is  accompanied  by  detailed  complexity  analyses. 

The  properties  of  the  parallel  band  solver  were  evaluated  by  implementing  the 
algorithm  on  an  Intel  iPSC  hypercube  parallel  computer  and  solving  a  larger  number 
of  banded  linear  equations  using  2  to  32  processors.  The  results  of  the  evaluation 
include  speed-up  over  a  sequential  processor,  and  the  measured  values  are  in  good 
agreement  with  the  theoretical  values  resulting  from  complexity  analysis.  It  is  found 
that  the  maximum  asymptotic  speed-up  of  the  multilevel  LU-factorization  using  p 
processors  and  load  balancing  is  approximated  well  by  the  expression  (p4-6)/4. 

Finally,  the  multilevel  parallel  solver  is  compared  with  solvers  based  on  row 
interleaved  organization  and  with  other  block  solvers. 
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1  Introduction 

Many  technical  and  scientific  problems  involve  the  solution  of  linear  systems  of  equations 

Ax  =  b  (1.1) 

where  A  can  be  structured  as  a  block  tridiagonal  matrix.  The  discretization  of 
boundary  value  problems  for  both  ordinary  and  partial  differential  equations  may  lead 
to  band  matrices  which  can  be  structured  into  block  tridiagonal  matrices.  Likewise, 
the  axialysis  of  linearly  connected  substructures  (mechanical,  electrical,...)  often  leads  to 
block  tridiagonal  matrices.  These  matrices  may  be  numerically  symmetric,  symmetric  in 
structure  only,  or  nonsymmetric  but  still  block  tridiagonal  in  structure. 

This  paper  presents  an  algorithm  for  the  efficient  solution  of  (1.1)  on  a  parallel  com¬ 
puter  with  medium  to  large  number  of  processors.  A  version  of  the  algorithm  for  solving 
structurally  symmetric  band  systems  has  been  implemented  for  the  Intel  iPSC  hypercube, 
and  the  performance  of  the  implementation  of  the  algorithm  is  evaluated  in  great  detail. 

The  multilevel  parallel  solver  is  based  on  block  cyclic  reduction  [1]  which  permits  sep¬ 
arate  LU-factorization  and  solution  stages.  The  formulation  of  the  block  cyclic  reduction 
which  we  have  used  is  related  to  nested  dissection  [2,3],  and  it  is  an  LU-factorization  and 
solution  of  a  block  reordered  system.  The  communication  required  by  the  band  matrix 
implementation  of  the  parallel  solver  is  shown  to  be  negligible. 

Our  implementation  of  the  parallel  solver  introduces  some  enhancements  to  standard 
block  cyclic  reduction  which  may  also  be  used  for  a  general  block  tridiagonal  matrix.  The 
first  enhancement  is  two-way  Gaussian  elimination  for  the  first  and  the  leist  block.  This 
eliminates  fill-ins  in  the  two  blocks  and  permits  the  second  enhancement,  an  efficient  load 
balancing  which  improves  the  efficiency  with  the  equivalent  of  6  extra  processors.  The 
third  enhancement  is  a  block  parallel  organization  which  improves  processor  utilization  at 
the  lower  levels  of  the  algorithm,  with  a  modest  communication  penalty. 

The  performance  of  the  implementation  of  the  multilevel  parallel  band  solver  on  the 
Intel  iPSC  with  32  processors  is  evaluated  carefully,  and  the  measured  execution  times 
are  compared  with  predictions  derived  from  complexity  analysis. 

Direct  parallel  solvers  can  be  classified  into  two  different  groups:  block  methods  (as  the 
one  described  here  and  [4,  5,  6,  7]  )  and  row  interleaved  methods  [8].  The  block  methods 
pay  a  heavy  penalty  in  terms  of  fill-ins  while  communication  cost  is  negligible.  The  block 
methods  can  exploit  many  processors,  limited  only  by  the  dimension/bandwidth  ratio. 

The  row  interleaved  algorithm  is  computationally  identical  to  a  sequential  Gaussian 
elimination.  The  pivot  row  is  broadcasted  and  the  processors  perform  the  elimination 
in  parallel.  Ideal  speed-up  is  prevented  by  communication  which  is  fairly  expensive  on 
medium  grain  parallel  processors  such  as  the  Intel  iPSC  [9].  The  row  interleaved  algorithm 
can  only  exploit  a  number  of  processors  corresponding  to  the  half  bandwidth. 

The  block  methods  are  therefore  advantageous  for  narrow  band  problems  and  medium 
to  large  number  of  processing  elements.  For  wide  band  problems  and  few  processors,  the 
row  interleaved  algorithm  is  the  better.  It  is  also  worth  mentioning  at  this  point  that 
almost  all  reported  implementations  of  parallel  solvers  of  banded  systems  are  done  on 


shared-memory  vector  machines,  such  as  the  Alliant  and  the  Cray  computers,  while 
our  implementation  is  done  on  a  distributed-memory  system  with  a  Umited  number  of 
processors  and  no  vectorization,  namely,  the  Intel  iPSC. 

This  paper  is  organized  as  follows.  In  section  2  block  tridiagonal  matrices  are  briefly 
described.  Sections  3  and  4  explain  the  multilevel  LU-factorization  and  solution  steps. 
Section  5  describes  the  general  principles  for  implementing  the  multilevel  algorithm  on 
a  hypercube,  including  the  communication  scheme  among  the  processors.  The  details  of 
the  implementation  on  the  hypercube  are  given  in  section  6.  Section  6.1  describes  the 
partitioning  and  allocation  of  tasks  to  different  processors  and  the  ordering  of  che  first 
and  the  last  partitions  to  reduce  fill-ins  by  using  2-way  Gaussian  elimination.  The  per¬ 
formance  of  the  multilevel  algorithm  is  estimated  in  section  6.2  using  complexity  analysis. 
The  important  problem  of  load  balancing  is  addressed  in  section  6.3.  As  a  result  of  the 
complexity  analysis  it  becomes  clear  that  uniform  partitioning  of  the  band  matrix  will 
lead  to  poor  load  balance  for  the  LU-factorization.  Balance  equations  for  selecting  the 
sizes  of  the  partitions  are  then  derived.  The  performance  of  the  parallel  band  matrix 
solver  is  presented  in  section  7.  This  includes  the  executing  time  graph  model  and  actual 
numerical  results.  A  comparison  between  the  multilevel  approach  and  the  row  interleaved 
factorization  and  solution  approach  is  given  in  section  7.3. 


2  Block  tridiagonal  matrices 

Consider  the  system  of  linear  equations  (1.1)  where  A,  x  and  b  are  partitioned  as  follows: 


■ 

■ 

■  6:  ■ 

C2  A2  B2 

X2 

62 

Cz  A3  Bz 

,x  = 

Xz 

,fe  = 

63 

Cfi  An  . 

.  Xn  . 

.  . 

N  is  odd  by  assumption  and 

ArCiZ"''’'"’’  and  XriKeR^''  for  r=  1,2,..,  A’. 


(2.1) 


for  r  =  1,2,..,  A  -  1  and 


^  ^  2, 3,..,  A. 

The  entries  of  A  are  zero  outside  the  tridiagonal  band  of  matrices. 
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Figure  2.1.;  Examples  of  typical  block  tridiagonal  matrices. 


Figure  2.1  shows  three  different  examples  of  typical  block  tridiagonal  matrices.  The 
first  matrix  occurs  in  the  lower  levels  of  the  multilevel  algorithm  applied  to  a  structurally 
symmetric  bcind  matrix.  The  second  matrix  is  a  nonsymmetric  band  matrix  with  a  block 
tridiagonal  structure  superimposed  on  it.  The  third  example  illustrates  two  connected 
arbitrary  nonsymmetric  sub-structures  grouped  into  a  block  tridiagonal  form. 

The  first  level  of  the  multilevel  algorithm  (block  cyclic  reduction)  is  based  on  the 
reordering  of  A  given  in  (2.2). 


Ai 

Bi  1 

A3 

C3  B3 

As 

Cs  Bs 

An 

Cn 

C2  B2 

A2 

C4  B4 

A4 

Ce  ... 

Ae 

...  Bn-\ 

1 

1 

The  sets  (Cr,  Ar,  B,,  5r_i,  Cr+i }  for  r=2,  4,  ..,  N-1  are  called  separators  since  they 
separate  the  matrix  A  into  independent  blocks  Ar,r  =  1,3, ..,  N.  The  reordering  of  A  into 
A  therefore  involves  a  symmetric  row  column  reordering  where  the  separators  are  moved 
to  the  last  rows  and  columns.  If  A  is  diagonally  dominant,  so  is  A  since  the  symmetric 
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reordering  preserves  diagonal  dominance.  Obviously,  any  symmetry  of  A  will  also  be 
preserved. 

N  is  assumed  to  be  odd,  and  this  is  merely  a  convenient  assumption.  If  N  is  originally 
even,  either  two  block  rows  and  columns  can  be  merged  to  reduce  N  by  one  or  the 
reordering  of  (2.2)  can  be  modified  slightly  to  account  for  an  even  value  of  N. 

In  practical  applications,  the  dimensions  of  separators,  n,  for  r=2,4,..,N-l,  are  usually 
small  compared  with  the  dimensions  of  the  remaining  blocks,  n,  for  r=l,3,...,N,  but  this 
is  not  a  necessary  condition  for  the  application  of  the  multilevel  algorithm  although  it 
may  be  a  necessary  condition  for  obtaining  high  efficiency. 

3  Multilevel  LU— factorization 

The  first  level  of  the  multilevel  LU-factorization  is  a  standard  LU-factorization  of  A 
defined  in  (2.2)  stopping  after  rii  A  nz . . .  +  pivot  rows.  This  leaves  the  lower  right- 
hand  block  of  dimension  n2  +  n4  . . .  +  unfactored.  The  partially  factored  matrix  is 
called  A  and  is  defined  in  (3.1) 


A  = 


Bx  1 

LzUz 

Oz  Bz 

LsUs 

Cz  Bs 

CjV 

Cz  Bz 

i42  D2 

C4 

£4  A4  D4 

Ce  ... 

Es  Ae  . . . 

. . .  B\j^\ 

. . .  A,v-i 

(3.1) 


The  submatrices  in  (3.1)  are  related  to  the  ones  in  (2.2)  as  follows 


LMr  =  Ar,r  —  l,3,..,iV 

are  lower  and  upper  triangular  matrices,  respectively. 

(3.2a) 

Br  =  L~^  Br,  r  =  1, 3, N  —  2. 

(3.2b) 

Cr  =  Cr,  r  =  3, 5, ..,  N. 

(3.2c) 

=C,+xU;\  r  =  l,3,..,iV-2. 

(.3.2d) 

Br-X  =  Br-xf^~\  r  =  3,5,..,./V. 

(3.2e) 

A,  =  A,  -  C,B,.x  -  B,C,+x,  5  =  2, 4, ..,  TV  -  1. 

(3.2f) 

6 


D,  =  D,-  ^  =  2,4, TV  -  3.  (3.2g) 

E,  =  Es-CsC,.i,  5  =  4, 6,.., TV  -  1.  (3.2h) 

The  tableau  in  (3.1)  and  the  relations  (3.2)  reveal  the  block  parallelism  in  computing 
A.  The  factorization  of  a  diagonal  block  A,  (r  is  odd)  and  the  associated  row  operations 

leading  to  C,,  5,,  Cr+i,  ^r-i  and  Er+i  can  be  performed  independent  of  the  cor¬ 

responding  computations  for  different  r-values.  The  only  overlap  is  in  the  computation 
of  A,  as  specified  in  (3.2f).  At  the  end  of  this  section,  a  convenient  organization  for  the 
parallel  computation  of  A  is  presented. 

Notice  that  D,  =  0  (s  =  2, 4, N  —  3)  and  Es  =  0  (s  =  4, 6, —  1)  for  the  block 
tridiagonal  matrix  A  defined  in  (2.1).  However,  the  multilevel  algorithm  handles  at  no 
extra  cost  the  case  where  the  separators  are  extended  with  D,  0  and  E,  ^  0.  In  this 
case  the  even  numbered  rows  and  columns  of  A  (the  separators)  will  contain  five  matrix 
blocks:  A,,B,,Iljand£),_2,B,_i,  A,,C,+i,£^,+2  ,  respectively  for  s  =  4.  6,  ..,  N-3. 

For  s  =  2,  Dq  and  E^  will  be  absent  while  Ds-i  and  Es+\  will  be  absent  for  s  =  N-1. 

The  D,  and  matrices  have  the  following  dimensions: 

for  s  =  2, 4, ..,  N-3 
5  =  4,  6,  AT  -  1. 

In  the  tridiagonal  case  (,2.1),  the  matrices  D,  and  E,  are  fill-ins.  The  LU-factorization 
will  in  general  also  create  fill-ins  in  the  original  blocks  of  A  unless  they  are  full  from  the 
outset. 

The  four  blocks  of  A  seperated  by  the  dzished  lines  can  be  expressed  in  a  compact 
form  as: 


A  = 


LU  V 
W  At 


where 


(3.3a) 


A  = 


’  L  o' 

U  V  ' 

W  I 

0  At 

(3.3b) 


At  is  the  Schur  complement  and  L  and  U  are  lower  and  upper  block  triangular  matrices, 
respectively,  composed  of  Lr  and  f/,.,  r=l,  3,  ..,  N. 

The  multilevel  algorithm  (block  cyclic  reduction)  is  now  bcised  on  the  fact  that  A,  is  a 
block  tridiagonal  matrix  which  can  be  reordered  into  a  form  similar  to  A  in  (2.2),  partially 
LU-factored  like  A  leaving  a  block  tridiagonal  partially  factored  lower  right-hand  block 
etc. 

If  V  _  _  1.  the  process  will  terminate  after  d  levels  with  a  Schur  complement 

consisting  of  just  one  block  which  is  then  factored.  If  N  is  composed  differently,  some 
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of  the  intermediate  block  tridiagonal  matrices  will  have  an  even  block  dimension,  but  as 
mentioned  in  the  previous  section,  this  is  only  a  minor  complication. 

The  partial  factorization  of  A  leading  to  A  can  be  performed  conveniently  in  parallel 
by  partitioning  the  original  block  tridiagonal  matrix  A  as  follows: 


Qi  = 


Bx 


and  Qn  = 


As  Cs 
Bs-i  0 


(3.4a.  b) 


Qr  = 


Ar 

C'r+1 

Br-X 


Br 

Ar+X 

Dr-X 


Cr 

Br+X 

0 


for  r  =  3, 5, N  —  2 


(3.4c) 


Dr~x  =  0  and  Er+x  =  0  when  A  is  block  tridiagonal,  and  D^-x  ^  0  and  ^  0 

only  when  A  has  extended  separators  as  discussed  priviously.  The  Q-matrices  can  be 
considered  slightly  reordered  samples  of  A  which  are  straightforward  to  establish. 

The  Q-matrices  can  be  partially  factored  in  parallel  leading  to  the  Q-matrices  defined 
in  (3.5). 


Qx  = 

■  LxUx 

c. 

Bx  ■ 
Aj  . 

and  O.v  =  [ 

Bs-i 

Cs 

Ajv-i 

(3.5a.  b) 

■  LrUr 

Br 

Cr  • 

Qr  = 

Ar+1 

Br+X 

for  r  =  3,5, 

(3.5c) 

.  Br-X 

Dr-X 

Ar-1  . 

The  Q-matrices  can  be  computed  conveniently  by  applying  standard  LU-factorization 
to  the  Q-matrices  and  stopping  when  Ar,  for  r  being  odd,  is  completely  factored. 

The  entries  of  the  Q-matrices  are  as  defined  in  (3.2)  except  for  A,  and  .4,: 

A,  =  A,  —  C,  B,-x  and 
A,  =  —  S,  C,+i  leading  to 
A,  =  A,  +  A,  for  s  =  2, 4, ..,  N  —  1 

Pivoting  has  not  been  considered  so  far  since  the  possibilities  are  limited  in  the  multi¬ 
level  algorithm.  From  the  partial  LU-factorizations  leading  to  the  Q-matrices,  however.it 
is  obvious  that  pivoting  can  be  done  as  usual  during  the  factorization  of  Ar  for  r  being  odd 
as  long  as  the  search  for  a  pivot  element  is  limited  to  Ar.  If  Ar  is  singular,  the  algorithm 
requires  fundamental  modifications  to  work  properly. 

The  next  level  of  the  multilevel  algorithm  involves  the  partial  LU-factorization  of  .4t 
defined  in  (3.3).  The  block  representation  of  A,  as  given  in  (3.1)  is 
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Aj  D2 
E\  A4  D4 
At  —  Ee  ^ 

...  An-i 

The  Aj-matrix  is  sampled  similarly  to  A  to  create  Q2,  Qe,--,  Qn-i  defined  in  (3.6). 
From  (3.5)  it  is  seen  that  Q2,  Qe,  Qn-i  can  also  be  obtained  by  sampling  Qr  for  r  being 
odd.  The  relationships  are  hsted  in  (3.7)  where  Qj+iiOa+s)  — ^  Qs  specifies  that 

Qa  is  composed  of  samples  from  Q,^i,Qa+i  and  Qa+s  etc. 

=  t]  •5"- =  1  b"!  V] 

■  Aa  pa  Ea  ' 

Qa=  Ea+2  Aa+2  0  for  s  =  6, 1 0, .. ,  iV  -  5.  (3.6c) 

.  Da-2  0  0 

(Q»-i,  Qi+i,  O1+3)  (5,  for  s  =  2,6, 10,..,N -5.  (3.7a) 

{Qs-2',  Qn^  —*  Qn-\  (3.7b) 

The  partial  LU-factorization  of  At  can  now  be  performed  in  parallel  by  partially 
factoring  Q2,  Qn-i  which  are  resampled  to  create  Q4,  Qs,..,  Qn-3  etc.  until  the 

LU-factorization  is  completed  by  factoring  just  one  block. 

Each  partial  LU-factorization  of  a  Q-matrix  completes  the  factorization  of  the  first 
block  row  and  column  leaving  the  lower  right-hand  2x2  block  unfactored  and  the  subject 
of  the  next  level  of  factorization. 

So  far  it  has  been  implied  that  only  one  processor  is  to  be  used  for  the  partial  factor¬ 
ization  of  a  Q-matrix.  Since  each  level  of  the  multilevel  algorithm  deals  with  only  half 
the  number  of  Q-matrices  as  the  previous  level,  one  might  consider  using  more  than  one 
processor  for  each  Q-matrix  to  try  to  keep  all  processors  busy. 

In  the  block  parallel  organization  described  in  [10],  one  processor  is  assigned  to  each 
Q-matrix  in  the  top  level,  two  processors  in  the  next  level  etc.  up  to  4  processors  for  the 
2x2  Q-matrices  and  8  processors  for  the  3x3  Q-matrices.  The  natural  partitioning  of  the 
Q-matrices  into  blocks  is  used  to  allocate  one  or  several  blocks  to  each  processor.  Since 
the  LU-factorization  is  partial,  this  approach  results  in  good  load  balance  and  moderate 
communication  overhead. 

4  Multilevel  solution 

The  purpose  of  the  solution  step  is  to  compute  x  of  (1.1).  The  solution  is  expressed 
symbolically  as 
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X  =  6 


(4.1) 


The  reordering  of  A  into  A  is  expressed  by  the  permutation  matrix  Hi  such  that 
Hi  A  HJ  =  A.  This  leads  to  the  equation 


Ax  =  6  (4.2) 

where  x  =  Hi  x  (and  x  =  x)  and  b  =  Hi  b.  Equation  (4.2)  can  now  be  expressed  by 
the  partial  factorization  in  (3.3b), 

■  L  0 
W  I 

where  (xu,Xf)  and  (6u,6t)  axe  partitionings  of  x  and  6,  respectively,  corresponding  to  the 
block  factorization  of  A. 

Equation  (4.3)  is  solved  by  the  standard  approach, 

t/u  =  6„  ,  bt  =  bt-  Wy^  (4.4a,  b) 


■  u 

V  ■ 

Xu 

■  feu 

0 

At 

. 

.  . 

(4.3) 


xt  =  At”'  bt 


(4.4c) 


Xu  =  iVu  -  V Xt)  (4.4d) 

The  computation  of  Xj  in  (4.4c)  expresses  symbolically  the  solution  of  a  block  tridi¬ 
agonal  system  of  equations  similar  to  (1.1),  and  (4.4c)  is  therefore  analogous  to  (4.1). 
The  algorithm  outlined  by  the  equations  (4.1),  (4.2),  (4.3)  and  (4.4)  is  therefore  applied 
recursively  until  all  components  of  the  solution  are  eventually  computed. 

When  the  complete  multilevel  LU-factorization  is  available,  the  multilevel  solution  is 
obtained  by  a  recursive  application  of  the  relations  (4.4a,b,d)  until  (4.4c)  involves  just 
one  LU-factored  block.  In  order  to  describe  the  multilevel  solution  algorithm  the  vectors 
Xu,Xt  ,  6u  and  bt  are  defined  from  the  partitioning  in  (2.1)  and  the  permutation  Hi. 


Xi 

X2 

■  fel  ■ 

b-2 

X3 

,  Xt  = 

X4 

,  feu  — 

fes 

,  fe*  = 

fe4 

. 

XN-1 

6;V 

fe;V-l 

Furthermore  partitioned  vectors  y„  and  bt  similar  to  x^  and  bt,  respectively,  are  defined. 
The  detailed  block  relations  corresponding  to  (4.4  a,b,d)  are: 

j/r  =  br  for  r  =  1,3,  ..,N.  (4.5a) 

br+i  =  6r+l  -  Cr+l  J/r  “  Br+l  yr+2  for  r  =  1, 3,  ..,  N  -  2.  (4.5b) 
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(4.5d) 


^1  =  (j/i  -  B1X2)  ,  XN  =  Uj;}'-  (i/jv  -  C/vi;v-i)  , 

Ir  =  U~^  (j/r  -  Ct  Xr-\  -  Brir+l)  fOF  T  =  3,5,  N  -  2 

Each  of  the  three  sets  of  relations  (4.5a),  (4.5b)  and  (4.5d)  can  be  computed  in  parallel 
using  (N+l)/2  processors.  The  computation  of  j/r  and  requires  factorized  matrix  blocks 
from  Qt  defined  in  (3.5)  while  requires  blocks  from  both  Qj-i  and  Q,+i. 


5  Hypercube  implementation 

This  section  describes  the  general  principles  for  the  implementation  of  the  multilevel 
algorithm  on  a  hypercube  parallel  computer  of  dimension  d.  The  number  of  processors 
is  p  =  2“^  ,  and  it  will  be  assumed  that  all  processors  are  used  for  the  top  level  of  the 
multilevel  algorithm  which  means  that  p  is  related  to  the  block  dimension  N  through  the 
following  relation. 


p  =  2‘'  =  (iV+l)/2  (5.1) 

If  N  is  determined  by  the  problem  (which  is  not  the  Ccise  for  a  band  matrix)  and  violates 
relation  (5.1),  the  multilevel  algorithm  may  require  minor  modifications.  If  jV  >  2p  —  1, 
more  processors  may  be  simulated  by  running  several  processes  on  each  processor.  If 
yV  <  2p  —  1,  some  processors  may  be  left  idle  or  more  than  one  processor  may  be  used  for 
some  or  all  Q-matrices  (3.4). 

It  will  be  assumed  in  the  rest  of  this  section  that  (5.1)  is  satisfied.  The  Q-matrices  of 
the  top  level,  level  1,  are  allocated  to  the  processors  as  follows.  Matrix  Q2.+1  is  allocated 
to  processors  P,  for  i=0,l,..,p-l.  This  is  expressed  formally  as 

Q2.+1  — ►  P,  for  i  =  0, 1,  ..,p  -  1  (5--i) 


The  allocation  relation  for  level  2  is 

Qix+2  — ►  P2,  for  i  =  0,  l,..,(p/2)  -  1  (5.2b) 


The  general  allocc  ’  on  relation  for  level  ^  is 

<?2^+2'->  — ^  P2«-I),  for  i  =  0, 1,..,  (p/2^^"*^)  -  1  (5.3) 

The  allocation  principle  is  illustrated  in  Fig.  5.1  for  p=S  and  consequently  N=15.  The 
processors  are  labeled  0,1,. .,7  expressed  as  binary  numbers.  Processors  in  a  hypercube 
can  be  labeled  such  that  processors  whose  labels  differ  in  only  one  bit  position  are 
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Processor 

000 

001 

010 

on 

100 

101 

no 

111 

Level 

1 

Qi 

Qz 

Qs 

Qr 

Q9 

Qn 

Q\3 

Qi5 

y 

y 

y 

/ 

2 

Q2 

Qe 

Qio 

Qi4 

y 

y 

3 

Q4 

Qi2 

y 

4  Qs 

Figure  5.1;  Processor  allocation  and  communication  structure  for  a  hypercube  implementation 

of  the  multilevel  algorithm. 


neighbors.  The  allocation  relation  in  (5.3)  is  constructed  to  permit  multilevel  factorization 
and  solution  using  only  neighbor  to  neighbor  communication. 

The  multilevel  LU-factorization  proceeds  as  follows.  At  level  1,  Q3, Qf^  are 

partially  LU-factored  in  parallel.  Then  the  unfactored  parts  of  Q3,  Qt,--,  are  passed 
on  to  the  processors  storing  Qi,  Qs,..,  Qn-2  (see  Fig.  5.1).  This  involves  only  neigh¬ 
bor  to  neighbor  communication,  and  it  can  be  done  completely  in  parallel.  However, 
according  to  (3.7a)  the  construction  of  Q2,  Qe, Qn-s  requires  contributions  from  three 
(j-matrices  from  the  previous  level,  and  thus  more  communication  and  a  more  elaborate 
communication  scheme  to  limit  communication  distance  are  required. 

In  order  to  circumvent  this  problem  and  use  the  simple  allocation  and  communication 
scheme  examplified  in  Fig.  5.1,  the  Q-matrices  of  levels  other  than  the  first  one  are 
redefined  slightly. 


Aj  D2 

£4  A4 


An-\  En-\ 

Ds-3  An-Z 


A, 

D,.2 


ps 

A»+2 

0 


£. 

0 

A,_2 


for  s  =  6, 10, ..,  N  —  5 


(5.4a.  b) 


(5.4c) 


The  definition  in  (5.4)  will  supersede  (3.6)  in  the  following,  and  the  prime  symbol  will 
be  left  out  from  the  Q-matrices  defined  by  (5.4).  Likewise,  the  construction  rule  (3.7)  is 
superseded  by 


{Q,-i,Qs+i)  Q,  for  s  =  2,6, ..,N  -  1 


(5.5) 


The  construction  of  the  general  matrix  Q,  in  (5.4c)  is  easily  verified  by  writing  Qa-i 
and  Q,+i  in  detail,  bzised  on  (3.5c): 
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■  Ls-xUs-, 

Bs-X 

Cs-X  ■ 

Ls-x-xUs+x 

Bs+X 

C'j+i 

Qa-l  — 

Cs 

As 

Es 

»  Qs+X  — 

Es+2 

As+2 

Es+2 

.  Bs.2 

Ds-2 

As-2  . 

.  Bs 

Ds 

As  . 

Qg  is  composed  from  the  lower  right-hand  2x2  blocks  and  A,  is  computed  as  A,  = 

Ag  +  Aj. 

The  observation  leading  to  the  redefinition  of  the  Q-matrices  for  levels  greater  than 
one  is  that  the  partial  factorization  of  a  Q-matrix  leaves  the  first  block  row  and  column 
completely  factored  and  the  rest  unfactored.  This  means  that  the  first  block  row  and 
column  (and  especially  A,,  s=2,6,..,N-l)  must  hold  the  final  values  before  the  factorization 
while  intermediate  values  (like  A^  and  As,  s=4,8,..,N-3)  suffice  for  the  remaining  entries 
of  the  Q-matrix. 

After  the  partial  LU-factorization,  the  Q-matrices  of  level  2  will  have  the  same  struc¬ 
ture  as  the  Q-matrices  of  level  1  defined  in  (3.5).  The  allocation  algorithm  ensures  that 
the  Q-matrices  on  each  level  are  allocated  to  processors  that  are  pairwize  neighbors. 
This  means  that  the  partial  LU-factorization  of  the  Q-matrices  of  level  i  and  neighbor  to 
neighbor  communication  to  compose  the  Q-matrices  of  level  £-f  1  can  be  continued  until 
the  last  level,  d-l-1.  At  the  last  level  Q(Af+i)/2>  which  is  only  one  block,  is  computed  by 
adding  the  lower  right-hand  blocks  of  Q(/v+i)/4  and  Q3(;v+i)/4  ,  and  finally  Q(n+i)I2  is 
fully  LU-factored. 

The  blocks  of  the  b-vector  defined  in  (2.1)  are  allocated  with  the  corresponding  A- 
blocks  (diagonal  blocks)  which  means  that 


{b2i+xMi+2)  — ^  Pi  for  i  =  0, 1, ..,  p  -  2 

(5.6a) 

1 

1 

(5.6b) 

With  this  allocation,  the  first  level  of  the  solution  algorithm  defined  in 
formed  as  follows 

(4.5a,b)  is  per- 

Po  '•  Vx  —  ,  62  =  ^>2  —  ^22/1 

(5.7a,  b) 

Pi  •  Vr  —  Ly,  by  ,  by+x  ~  by+x  ^r+lj/r 

(5.7c,  d) 

^r— 1  —  By—xUr 

(5.7e) 

for  r  =  2i  -I-  1  and  i  =  1, 2, p  —  2 

Pp-x  ■  Vn  =  Ls^bs  ,  i/v-i  =  —Bn-xVn 

(5.7f,g) 
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The  partial  6-vectors  6,  and  6,  are  associated  with  the  unfactored  diagonal  blocks 
A,  and  Af  This  implies  that  they  are  communicated  along  the  same  routes  and  the 
ccilculation  6,  =  6,  -b  6,  takes  place  in  the  same  processor  and  at  the  same  level  as  the 
calculation  As  =  A,  A^. 

The  first  communication  step  then  involves 

Send  (b^i+a ,  641+4)  to  Pai  for  i  =  0, 1, (p/2)  -  2. 

Send  Sn-i  to  Pp_2 

At  level  2  the  computation  of  66,..,  b^-i  are  completed  according  to  (4.5b), 

641+2  =  ^41+2  +  641+2  for  i  =  0, 1, ..,  (p/2)  -  1  (5.8) 

The  remaining  6-vectors,  64,  63,..,  bfi-z  atre  not  computed  at  ie\el  2  since  they  are 
not  needed  at  this  level.  Besides,  their  computation  would  require  communication  among 
non-neighbors. 

The  computation  algorithm  and  communication  scheme  as  outlined  can  be  continued 
down  to  the  bottom  level  and  up  again  according  to  (4.5d).  Instead  of  giving  a  formal  al¬ 
gorithm  which  will  come  out  rather  complicated,  the  informal  description  of  the  multilevel 
solution  algorithm  will  be  supplemented  with  an  example  for  N=7  and  p=4  processors. 
The  example  shown  in  Fig.  5.2  also  includes  the  Q-matrices  and  a  specification  of  the 
level  where  they  were  computed. 

The  communication  pattern  of  the  solution  phase  can  be  deduced  from  the  data  de¬ 
pendencies  among  the  different  levels  of  computations  shown  in  Fig.  5.2. 
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Figure  5.2:  Parallel  solution  algorithm  for  N=:7  based  on  multilevel  LU-factorization. 

6  Parallel  band  matrix  solver  for  hypercube 

6.1  Partitioning  and  allocation 

The  principles  of  a  multilevel  parallel  solver  described  in  the  previous  sections  were  used 
for  an  implementation  on  the  Intel  iPSC  hypercube.  The  implementation  is  restricted  to 
linear  systems  with  structurally  symmetric  band  matrices.  This  restriction  is  exploited 
in  the  implementation  to  improve  the  basic  implementation  principles  described  in  the 
previous  section.  The  improvement  involves  the  equivalent  of  2-way  Gaussian  elimina¬ 
tion  for  the  partial  factorization  of  Qi  and  Qtf  to  permit  an  efficient  load  distribution. 
Programs  and  documentation  are  presented  in  [11]. 
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The  band  matrix  is  charaterized  by  dimension  n  and  by  bandwidth  w=l'|-2m  where 
m  is  the  number  of  upper  or  lower  ofF-diagonal  elements. 

The  block  tridiagonal  structure  of  (2.1)  is  imposed  on  the  band  matrix  by  choosing  N 
and  Til, n2—,nN  properly.  There  exist  the  following  constraints: 

N 

^  Hr  =  n  and  n,  >  m  for  r  =  1,2, ..,  N 

r=l 

N  is  chosen  to  match  the  actual  hypercube  (or  sub-cube)  of  dimension  d  which  means 
that  relation  (5.1)  is  fulfilled. 

The  dimension  of  the  separators  is  chosen  to  the  minimum  dimension, 

n,  =  m  ,  s  =  2,4,..,  -  1 

in  order  to  minimize  the  amount  of  work  involved  in  the  lower  levels  of  the  algorithm. 

The  dimensions  of  the  odd  numbered  blocks  are  in  general  determined  by  the  load 
distribution  algorithm  which  will  be  described  later.  However,  a  special  case  should  be 
mentioned  here,  namely  the  minimum  dimension  problem  where  nr  =  m  ,  r  =  1,2,..,  A/^. 
This  is  the  case  where  the  blocks  of  the  block  tridiagonal  matrix  in  (2.1)  are  chosen  as 
small  as  possible.  This  could  be  the  case  if  the  algorithm  is  excuted  by  a  fine  grain  parallel 
computer  where  the  maximum  number  of  blocks  for  a  particular  band  matrix  are  required 
in  order  to  exploit  as  many  processors  as  possible. 

Besides,  the  lower  right-hand  block  tridiagonal  matrix  of  (3.1)  (called  At  in  (3.3))  is  a 
minimum  dimension  matrix  with  all  blocks  of  dimension  m  when  A  originates  from  a  block 
tridiagonal  matrix  with  seperators  of  dimension  m.  The  minimum  dimension  property 
also  applies  to  the  lower  levels. 

The  structure  of  Qr  for  r=3,  5,  ..,  N-2  is  given  in  Fig.  6.1.  The  densely  dotted  areas 
correspond  to  the  non-zero  entries  of  Qr  with  Dr-i  =  0,  £’r+i  =  0  and  the  lightly  dotted 
areas  correspond  to  fill-ins  created  by  the  partial  factorization.  Q\  and  are  depicted 
similarly  in  Fig.  6.2  and  6.3. 

The  partial  factorization  of  Qr  leading  to  Qr  for  r=3,5,..,N  results  in  rather  severe 
amounts  of  fill-ins,  not  just  in  terms  of  zero  blocks  being  filled,  like  Ar-i  ,  Dr-i  and  Er+i 
but  also  fill-ins  inside  Br-i  and  Cr  ■  Qi  is  p£irtially  factored  completely  without  fill-ins 
while  Qf^  is  similar  to  the  general  Qr-matrix  in  this  respect. 

However,  it  is  possible  to  make  the  partial  LU-factorization  of  level  1  symmetric  by 
modifying  the  factorization  of  Qiv-  The  ordering  of  the  band  matrix  Aj^  of  Q,\  is  reversed 
by  a  synunetric  row  column  reordering.  The  column  reordering  also  includes  H.v-i  and 
the  row  reordering  includes  C^-  After  this  reordering,  the  partial  factorization  of  Qy 
leads  to  a  matrix  Qy  with  a  structure  identical  to  Qi  shown  in  Fig.  6.2. 

The  reordering  of  Qy  results  in  a  considerable  saving  in  the  operations  count  of  the 
partial  LU-factorization.  The  large  differences  in  operations  count  in  partially  factoring 
Qt  and  Qy  on  one  side  and  Qa,  Qs, Qa^-2  on  the  other  side  are  exploited  in  the  load 
distribution  algorithm. 
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Figure  6.1.  The  structure  of  ^  for  r^3,5,...,N-2 


Figure  6.2.  The  structure  of 


Figure  6.3.  The  structure  of 


The  Q-matrices  of  the  lower  levels  are  all  block  matrices  with  full  mxm  blocks.  The 
extreme  matrices,  corresponding  to  Q\  and  Qn,  are  2x2  block  matrices  while  the  interior 
matrices  have  block  dimension  3.  The  matrix  at  the  lowest  level  Q(;v+i)/2  is  just  one  block 
of  dimension  m. 

6.2  Complexity  analysis 

The  performance  of  the  multilevel  algorithm  can  be  estimated  on  the  basis  of  complexity 
analysis.  Let  Ft  denote  the  number  of  floating  point  operations  required  fur  the  partial 
factorization  of  Qr-  We  then  have  the  following  operations  counts. 

Level  1 


Fr  (n,)  =  n,m (2m  +  1)  ,  r  =  l,A^. 


(6.1a) 
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(6.1b) 


Fr  (n,)  =  2nrm  (4m  +  1)  ,  r  =  3, 5, A’  —  2. 
Level  i{2<t<d) 


r.  r.  14  3  3  2  1 

=  F;v+1-2<-i  =  -  2"*  “6^ 

38  3  5  2  1 

^  3  2  6 

for  t  =  l,2,..,(p/2^-')  -2 


(6.1c) 

(6.1d) 


Level  d+1 


F2d  =  -m^  —  -m^  —  -m 
3  2  6 

The  operations  count  of  a  standard  band  LU-factorization,  called  Fblu  is 


(6.1e) 


4  3  1 

Fblu  =  nm  (2m  +  1)  -  -m^  -  -m^  -  -m  (6.2) 

3  2b 

A  rough  estimate  of  the  total  complexity  of  the  multilevel  algorithm  is  F\fm  « 
2  n  m  (4m  +  1)  for  n  m  and  ^  1.  This  implies  that  Fmlu ! Fblu  4  which  is 
the  penalty  for  being  able  to  do  the  LU-factorization  in  parallel.  According  to  this,  the 
maximum  speed-up  from  the  multilevel  algorithm  using  p  processors  is  expected  to  be 
p/4.  A  more  accurate  speed-up  calculation  including  the  effect  of  the  load  balancing  will 
be  given  in  the  next  section. 

The  parallel  complexity  of  the  multilevel  LU-factorization  called  Fpixj  can  be  com¬ 
puted  as  the  sum  of  the  dominating  complexities  at  each  level, 


Fpiu  =  F2d_i  (n2<<_i)  +  (d  —  2)  f2d_2  + 

=  2n2</_im  (4m  -|-  1)  -b  (d  —  2)  f  ^m^  —  ^m^  —  \m)  -f  ■^m^  —  2m}  —  ^m  (6.3) 

\  3  2  6  /  3  3 

F2'i-i  and  F2d_2  represent  complexities  at  top  and  intermediate  levels,  respectively.  The 
expression  in  (6.3)  is  correct  under  certain  assumptions  about  the  partitioning,  e.g.  n\  = 
ns  =  ..n;v  or  the  partitioning  defined  by  the  load  balance  relations  (6.10).  Fpiu  is  valid  for 
d  >2,  and  using  p  =  2"^  processors,  Fpiu  gives  the  execution  time  for  an  LU-factorization 
when  multiplied  by  the  average  floating  point  excution  time  called  Tp. 

The  communication  time  model  is  based  on  the  model  [9] 

<  =  To  +  MIB  (6.4) 

where  the  latency  Tq  =  1.2m5ec  and  bandwidth  B=1  MByte/sec  are  measured  values  for 
neighbor  to  neighbor  communication  on  the  Intel  iPSC.  The  model  in  (6.4)  is  valid  for 
messages  of  length  M  bytes  fulfilling  0<  M  <  1024  bytes. 

The  multilevel  LU-factorization  algorithm  involves  the  communication  of  2mx2m 
matrices  from  one  level  to  a  lower,  except  to  the  last  level  where  only  an  mxm  matrix  is 


18 


transferred.  The  2mx2m  matrices  of  double  precision  numbers  fit  into  iKByte  for  m<5. 
Under  this  assumption,  the  communication  model  corresponding  to  (6.3)  is 


Tclu  =  {d-l)  (To  +  32mVB)  +  To  +  B  (6.5) 

Table  6.1  shows  Tew  I  (Tp  Fpw)  for  the  minimum  dimension  problem  {Fpw  is  com¬ 
puted  from  (6.3)  with  n2<<_i  =  m).  Tp  =  50/isec  is  an  approximate  value  of  the  gross 
floating  point  execution  time  measured  for  the  multilevel  algorithm.  Communication  is 
O(m^)  while  computation  is  O(m^). 


d\m 

1 

2 

3 

4 

5 

6 

2 

3.75 

0.48 

0.153 

0.070 

0.041 

0.034 

3 

3.68 

0.39 

0.125 

0.058 

0.033 

0.030 

4 

3.63 

0.36 

0.115 

0.054 

0.031 

0.029 

5 

3.61 

0.34 

0.109 

0.051 

0.030 

0.028 

Table  6.1  Communication  to  computation  ratio,  Tcw/{TpFpw)  , 
for  the  multilevel  LU-factorization  algorithm  applied  to 
the  minimum  dimension  problem. 

This  is  clearly  reflected  by  Table  6.1  which  shows  that  communication  dominates  for  m=l 
but  loses  significance  very  quickly  for  increasing  values  of  m.  The  model  (6.4)  and  therefore 
also  (6.5)  is  only  valid  for  m<5.  The  model  was  extended  also  to  include  m=6  which  is 
shown  in  Table  6.1.  This  demonstrates  that  it  is  sufficient  to  take  the  communication 
cost  into  account  for  m<5. 

The  communication  to  computation  ratios  of  Table  6.1  can  be  considered  worst-ccise 
values  for  the  multilevel  algorithm.  The  operations  count  Fpw  in  (6.3)  is  rri^) 

while  Tew  in  (6.5)  is  O(m^).  This  means  that  communication  becomes  negligible  for 
Tir  m,  r=l,  3,..,  N,  even  for  m=l. 

The  speed-up  of  the  parallel  multilevel  LU-factorization  algorithm  is  defined  as  the 
execution  time  of  a  standard  band  LU-factorization  on  a  single  processor  divided  by  the 
execution  time  of  the  parallel  algorithm  executed  on  p  processors. 

The  speed-up  is  computed  as  Sw  =  TpFpw/  {TpFpw  +  Tew)  where  the  complex¬ 
ities  are  given  in  (6.2),  (6.3)  and  (6.5).  Table  7.1  shows  speed-up  values  computed  for 
the  minimum  dimension  problem  (n2<<_i  =  m)  for  selected  values  of  m  and  d.  Theoretical 
values  of  Sw  are  given  in  parantheses  for  comparison  with  experimentally  determined 
values. 

The  multilevel  solution  algorithm  consists  of  a  forward  sweep  (levels  1  to  d+1)  and 
a  backward  sweep  (levels  d-fl  to  1),  cf.  Fig.  5.2.  Analogously  to  the  LU-factorization, 
only  the  computations  of  level  1  depend  on  the  total  dimension  n  of  the  problem  while 
the  lower  levels  only  depend  on  m. 

The  complexity  of  the  solution  algorithm  is  stated  separately  for  the  forward  and  the 
backward  sweep.  Let  Fr  denote  the  complexity  of  the  computational  step  during  the 
forward  sweep  involving  the  partially  factored  matrix  (Jr,  e.g.  the  complexity  of 
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Vr  —  ^r+l  —  ^r+l  ^r+l  Vr  »  ^r-1  —  ^r-l  Vr 

Similarly  Fr  denotes  the  complexity  of  a  computational  step  involving  Qr  during  the 
backward  sweep.  We  then  have  the  following  complexities: 

Level  1 

Fi  (ni)  =  2nim  ,  Fjv  (njv)  =  2n^m  —  m  (6.6a,  b) 

Fr  (n,)  =  4nrm  —  m  ,  r  =  3,5, N  —  2.  (6.6c) 

Fi  (ni)  =  ni  (2m  +  1)  ,  Fv  (2m  +  1)  (6.6d,e) 

Fr  (n,)  =  Ur  (4m  +  1)  ,  r  =  3, 5, N  —  2.  (6.6f) 

Level  ^  (2  <  £  <  d) 

Fcz-i)  =  Ff^+t-2<<-t)  =  F2(t-i)  =  Fv+i-2<'-')  =  3m^  (6.6g) 

-  m,  F2/j+2<<-‘)  =  for  i  =  1,2,...,  -  2  (6.6h,i) 

Level  d+1 

F(n+i)/2  =  F(n+i)/2  =  (6-6j) 

The  parallel  complexity  of  the  solution  algorithm  Fps  can  now  be  computed  as  the 
sum  of  the  dominating  complexities  at  each  level.  The  expression  (6.7)  is  valid  under  the 
same  assumptions  as  (6.3). 

Fps  =  (d  —  2)  ^lOm^  —  m^  +  n2<<_i  (8m  +  1)  +  8m*  —  m  (6.7) 

The  communication  model  for  the  multilevel  solution  algorithm  is 

Tcs  (^)  =  2  [(d  -  1)  (To  +  I6m/B)  +  To  +  8m/B]  (6.8) 

The  parameters  Tq  ajid  B  are  given  in  connection  with  (6.5),  and  (6-8)  is  valid  for  m<64 


d\m 

1 

2 

5 

10 

15 

20 

2 

5.36 

1.43 

0.24 

0.062 

0.028 

0.016 

3 

5.36 

1.38 

0.23 

0.058 

0.027 

0.015 

4 

5.36 

1.35 

0.22 

0.056 

0.026 

0.015 

5 

5.36 

1.34 

0.22 

0.055 

0.025 

0.015 

Table  6.2.  Communication  to  computation  ratio,  Tcs/  (TpFps),  for  the 
multilevel  solution  algorithm  applied  to  minimum  dimension  problems. 
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Table  6.2  shows  how  communication  affects  the  efficiency  of  the  multilevel  solution 
algorithm  applied  to  minimum  dimension  problems,  i.e.  the  computational  complexity 
Fps  is  computed  from  (6.7)  for  nja.t  =  m.  Comparing  with  Table  6.1,  it  is  seen  that  the 
solution  algorithm  is  communication  bounded  to  a  larger  extent  than  the  LU-factorization 
algorithm. 

Communication  cost  only  depends  on  m  and  d  (6.8)  while  the  computational  com¬ 
plexity  is  0(n2d_im).  This  means  that  Table  6.3  shows  a  worst-case  situation,  and  that 
communication  cost  even  for  m=l  becomes  negligible  for  n2d_i/m  ^  1. 

The  operations  count  of  a  standard  forward-backward  solution  based  on  an  LU- 
factored  band  matrix  is 


Fbs  =  4nm  -bn  —  2m^  —  2m  (6.9) 

The  speed-up  of  the  multilevel  solution  algorithm  can  now  be  computed  for  the  mini¬ 
mum  dimension  problem  as  5$  =  TpFpsI  {TpFps  -t-  Tcs)  where  the  complexities  are  given 
in  (6.7)  (for  n2<<_i  =  m),  (6.8)  and  (6.9).  Table  7.3  in  section  7  below  shows  speed-up 
values  computed  from  Ss  for  selected  values  of  m  and  d.  The  speed-up  values  estimated 
by  5s  are  given  in  parantheses  for  comparison  with  experimentally  determined  values. 

6.3  Load  balancing 

From  (6.1  a,b)  it  is  clear  that  a  uniform  partitioning  of  the  band  matrix,  ui  =  rij  =  ..  = 
njv,  will  lead  to  poor  load  balance  for  the  LU-factorization.  A  first  attempt  to  improve 
the  load  balance  would  be  to  choose  na  =  ns  =  ..  =  n;v-2  and  nj  =  n.v  =  4n3i|"|)'^.  This 
choice  results  in  Fi  =  F3  =  ..  =  Fn  and  thus  load  balance  in  level  1.  However,  some 
imbalance  still  remains  in  the  lower  levels  as  specified  by  (6.1  c,d). 

The  lower  levels  could  be  load  balanced  after  the  same  principle  followed  in  level  1 
by  chosing  some  of  n2,  n^, ..,  n;v-i  greater  than  m,  but  this  is  undesirable  since  it  would 
increase  the  computational  load  of  the  lower  levels  where  parallelism  is  harder  to  exploit. 

The  load  balancing  scheme  chosen  for  the  band  matrix  can  be  explained  by  referring 
to  the  complexity  relations  (6.1)  and  to  Fig.  5.1.  The  factorization  of  Qi  and  Q3  finish 
at  the  same  time  when  nj  =  4n3^^^  permitting  processor  Fo  to  start  the  factorization 
of  Q2  without  idle  time.  Likewise,  the  factorization  of  Qs  and  Q7  complete  at  the  same 
time  when  =  n-r.  The  ratio  n^/rij  is  now  adjusted  such  that  the  factorization  of  Q2 
and  Qs  finish  at  the  same  time.  This  would  complete  the  load  balancing  of  Fig.  5.1  since 
the  matrix  partitioning  is  symmetric  ni  =  nAr,n3  =  n;v-2i  •  •  ■ 

The  balancing  principle  is  easily  generalized  and  (6.10)  gives  a  general  set  of  balance 
equations  based  on  (6.1).  For  a  given  set  of  values  n,  m,  and  d.  the  load  balance  equations 
determine  ni,  03, ..,  niv.  The  equations  are  ecisily  solved  by  expressing  ns.  nr. ..  by  nj  and 
substituting  these  expressions  into  the  last  equation  and  solving  it  for  ni. 

n,.  -  r  -  1,3,..,(A'' -  l)/2  (symmetry)  (6.i0a) 

ns  =  ut  (6.10b) 
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^9  —  nil  —  ^13  —  ni5 


(6.10c) 


—  n2(<i— 1)^3  —  ..  —  n2d_i  (6.106) 

Fi  (ni)  =  ^3  (ns)  (6.10e) 

(^i)  +  F2  =  Ft  {ut)  +  Fq  (6.10f) 

(”i)  +  2F2  =  Ci5  (nis)  +  2Fi4  (6.10g) 

F\  (ni)  +  (6  -  2)  F2  =  F2d_i  (n2<i_i)  +  (6  -  2)  F2d_2  (6.10h) 

2  ^ni  +  n3  +  2n7  +  4ni5..  -f  2‘*”^n2.<_i)  +  (2“^  —  l)  m  =  n  (6.10i) 


The  solution  of  (6.10)  will  in  general  not  lead  to  integer  values  of  ni,n3,..,  but  the 
resulting  values  are  rounded  to  satisfy  the  last  equation  which  states  that  the  total  number 
of  equations  is  n.  When  n  is  too  small,  an  effective  load  balance  is  not  obtained.  The 
balance  equations  (6.10)  yield  n,-values  smaller  than  m  which  is  not  permitted  by  the 
present  implementation.  This  situation  is  dealt  with  by  increasing  the  n,.-values  smaller 
than  m  to  become  m,  and  by  reducing  the  rir-values  greater  than  m  correspondingly. 

Communication  cost  has  not  been  included  into  the  load  balance  equations  since  com¬ 
munication  cost  is  such  a  small  fraction  that  it  can  only  be  taken  properly  into  consider¬ 
ation  for  m<2.  This  can  be  seen  from  the  following  argument. 

Equation  (6.10e)  is  modified  to  include  communication  cost: 

Fi  (ni)  =  F3  (n3)  -f  (Tq  -t-  32m^ I /TV 

Since  F^in^)  =  2njm{im  +  1),  communication  is  only  going  to  affect  after  rounding 
if 


2  (Fo  -I-  32m Vf)  /Tf  >  2m  (4m  +  1 ) 

The  break-even  value  for  m  is  m=2.52  which  means  that  communication  cost  is  too 
small  in  a  relative  sense  to  be  included  in  the  load  balance  if  m  >  2.  This  is  in  good 
agreement  with  Table  6.1. 

It  is  obvious  from  (6.1)  that  the  execution  time  for  the  multilevel  LU-factorization  is 
proportional  to  n  when  ni  =  ns  =  . . .  =  tin.  When  the  partitioning  is  based  on  the  load 
balance  equations  (6.10),  the  complexity  Fpnj  defined  in  (6.3)  and  thus  the  execution 
time  is  still  proportional  to  n  since  n2<<_i  in  (6.3)  depends  linearly  on  n  through  the  load 
balance  equations  (6.10).  The  derivative  obtained  by  solving  (6.10)  is: 

5n2-_i/5n  =  [4  (4m -b  1)  /  (2m -f  1) -f- 2  +  4 -f  8  +  . . . -b  2'^“']  *  (6.11) 

The  multilevel  solution  algorithm  can  be  load  balanced  using  the  equations  (6.10) 
where  Fr  functions  from  (6.6)  are  substituted  for  F,..  This  entails  some  approximations 
besides  ignoring  communication  cost.  First,  (6.10)  has  rij  =  while  Fj  (rj)  F\  {rii): 
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and  secondly,  Fr  ^  Fr  which  means  that  balance  is  only  obtained  for  the  forward  sweep. 
The  discrepancies,  however,  are  either  0(m)  or  0(n,.)  and  do  not  lead  to  serious  imbalance. 

A  crude  approximation  to  the  solution  algorithm  load  balance  can  be  based  on  the 
equation  Fi  (nj)  =  F3  (na)  ,  n-i  =  ns  and  ns  =  ns  =  ..  =  ns-2-  This  leads  to  ni  % 
2n3  which  should  be  compared  with  the  corresponding  relation  nj  %  4n3  for  the  LU- 
factorization.  This  implies  that  optimum  load  balance  requires  different  partitionings  of 
the  band  matrix  for  LU-factorization  and  solution.  Since  the  solution  must  be  bcised 
on  the  result  of  a  factorization  and  the  load  distribution  chosen  for  the  factorization, 
load  distribution  should  in  practice  be  chosen  to  minimize  total  execution  time  of  LU- 
factorization  and  solution(s). 

The  minimization  of  the  execution  time  of  a  multilevel  LU-factorization  followed  by 
a  sequence  of  solution  steps  (e.g.  for  pseudo-Newton  iteration)  is  a  complicated  prob¬ 
lem  and  will  not  be  addressed.  If  only  one  solution  per  LU-factorization  is  needed,  the 
forward  sweep  of  the  solution  algorithm  can  be  merged  with  LU-factorization  leading  to 
a  multilevel  Gaussian  elimination,  and  this  part  can  easily  be  load  balanced  by  solving 
(6.10)  with  Fr  -I-  Fr  substituted  for  Fr-  The  backward  sweep  will  not  be  well  balanced, 
but  execution  time  can  only  be  reduced  by  a  different  load  distribution  for  a  very  large 
number  of  processors  and  very  small  value  of  m. 

7  Performance  of  the  parallel  band  matrix  solver 

7.1  Execution  time  model 

Figure  7.1  shows  an  execution  time  model  for  sequential  band  LU-factorization  (marked 
Tblu)-  Execution  time  for  the  sequential  algorithm  is  proportional  to  Fblv  defined  in 
(6.2). 

The  execution  time  graph  for  the  parallel  multilevel  algorithm  has  more  complicated 
features.  For  a  given  number  of  processors,  p=(N-|-l)/2,  the  smallest  problem  that  can 
be  solved  has  dimension  n=m  N.  Therefore  the  e.xecution  time  graph  starts  at  n  =  m  X. 
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T^blu 


Figure  7.1.:  Executing  time  T  as  a  function  of  problem  dimension 
n  for  a  given  set  of  m  and  N(=2‘^'^  -1). 


The  load  balance  equations  (6.10)  lead  to  the  following  relations: 


Til  >  n3>  m  >  ”15  >  . . .  >  n2<j_i 

This  means  that  the  smallest  value  of  n  for  which  load  balance  is  effective  (n=n£,)  is  defined 
by  n2</_i=m.  This  value  is  substituted  into  (6.10h)  from  which  ni  can  be  computed. 
The  remaining  n^-values  can  now  be  computed  from  equations  (6.10e,f,g,..)  and  is 
computed  from  (6.10i)  ais 

Til,  ~  2  -b  Tij  +  2n7  +  4ni5  +  ..  +  2'^  +  ^2^^  —  1^  m 

Execution  time  is  constant  {Tpiu  =  for  m  N  <  n  <  n^,. 

For  n  >  Til,  the  execution  time  increases  linearly  with  n,  according  to  (6.3)  and  (6.11). 
The  speed-up  of  the  parallel  multilevel  algorithm  over  the  sequential  algorithm  for  a  given 
value  of  n  is  derived  from  Fig.  7.1  as  Sw  =  Tblu/Tplu.  The  speed-up  is  a  nonlinear 
function  of  n. 

Two  speed-up  values  are  of  particular  interest,  namely  the  speed-up  of  the  minimum 
dimension  problem, 

Siu*  =  TBLulTpLu\n=mN  (f-l) 

and  the  maximum  asymptotic  speed-up  computed  for  n  >  ni  : 

=  (dTBLufdTi)  /  {dTpiu/dti)  (7.2) 
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The  speed-up  of  the  minimum  dimension  problem  is  a  worst-caise  value.  It  is 
characterized  by  only  two  parameters,  m  and  N,  and  it  does  not  involve  load  balancing. 
The  maximum  asymptotic  speed-up  cannot  be  attained  since 

5^  =  lim  Slu 

and  it  corresponds  to  the  speed-up  obtained  by  neglecting  the  computation  involved  in 
the  lower  levels  of  the  multilevel  algorithm.  For  a  given  value  of  n,  the  speed-up,  Sm,  is 
bounded  as  follows: 

Smin  ^  c  ^  coo 

Lu  S  JLU  <  om 

The  discussion  so  far  has  only  been  concerned  with  LU-factorization.  The  solution 
algorithms  have  the  same  qualitative  properties  and  S'g'’'  and  5|^  are  defined  analogously 
to  and  in  (7.1)  and  (7.2). 

7.2  Numerical  results 

Figure  7.2  shows  an  example  of  the  problems  which  were  solved  by  the  parallel  multilevel 
LU-factorization  in  order  to  verify  the  properties  of  the  algorithm  experimentally.  The 
graphs  of  Fig.  7.2  are  of  the  types  presented  in  Fig.  7.1.  The  dots  indicate  measured 
values.  All  results  in  this  section  are  based  on  measurements  reported  in  [11]. 

An  interesting  feature  of  Fig.  7.2  is  that  the  graphs  intersect.  This  hais  as  a  con¬ 
sequence  that  certain  problems  are  solved  more  efficiently  by  fewer  processors  than  the 
maximum  number.  The  phenomenon  originates  from  the  load  balance  algorithm  and  from 
the  fact  that  a  doubling  of  the  number  of  processors  increases  the  depth  of  the  multilevel 
algorithm  by  one. 
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m\p 

2 

4 

8 

16 

32 

Hi 

- 

(1.06) 

0.7 

(0.78) 

0.8 

(1.03) 

1.14 

(1.54) 

2.13 

(2.44) 

1.4 

(1.56) 

0.95 

(0:99) 

1.10 

(1.19) 

1.60 

(1.72) 

2.57 

(2.67) 

19 

1.5 

(1.69) 

0.96 

(0.99) 

1.14 

(1.17) 

1.64 

(1.67) 

2.57 

(2.59) 

1.6 

(1.71) 

0.96 

(0.95) 

1.14 

(1.10) 

1.65 

(1.57) 

2.56 

(2.43) 

191 

1.6 

(1.72) 

0.97 

(0.95) 

1.15 

(1.10) 

1.65 

(1.57) 

2.53 

(2.43) 

Table  7.1.  Measured  and  predicted  (in  paranthesis)  values  of  speed-up  of  the  multilevel 
parallel  LU-factorization  over  a  sequential  band  factorization.  The  speed-up, 
applies  to  the  minimum  dimension  problem,  N=2p-1  and  n=m  N. 


Table  7.1  shows  measured  and  predicted  (in  parantheses)  values  of  speed-up  for  the 
parallel  multilevel  LU-factorization  algorithm  applied  to  the  minimum  dimension  prob¬ 
lem.  All  execution  times  are  measured  with  a  resolution  of  5msec.  This  means  that 
execution  time  for  m=2  and  p=2  cannot  be  measured  (2-3msec)  and  that  execution  time 
for  m=5  and  p=2  is  not  very  accurate  (20-25msec).  The  measurements  only  include  LU- 
factorization  and  corresponding  communication  for  the  parallel  algorithm.  Downloading 
of  programs,  set-up  of  problem  etc.  are  excluded  from  the  execution  time  measurements. 

The  predicted  speed-up  values  are  computed  as  explained  in  Section  6.1.  The  expres¬ 
sion  for  Fpiu  given  in  (6.3)  does  not  include  p=2  (d:=l).  A  special  formula,  which  is 
easily  derived,  was  used  for  this  column  in  Table  7.1. 

There  is  good  agreement  between  measured  and  predicted  speed-up  values  except  for 
m=2  where  overhead  like  procedure  calls  and  initialization  leads  to  smaller  speed-up  than 
expected  from  the  model  which  only  includes  floating  point  operations. 

For  m=15  and  m=20  the  observed  speed-up  is  slightly  greater  than  the  predicted 
speed-up  for  p>4.  This  phenomenon  could  be  explained  by  the  fact  that  the  block 
structure  of  the  Q-matrices  in  the  multilevel  algorithm  leads  to  the  equivalent  of  unrolling 
of  the  loops  of  the  factorization  algorithm.  The  sequential  band  matrix  factorization  is 
programmed  in  a  straightforward  style. 

Table  7.2  shows  measured  and  predicted  (in  paranthesis)  values  of  the  maximum 
asymptotic  speed-up  for  the  parallel  multilevel  LU-factorization.  Since  these  speed-up 
values  correspond  to  neglecting  the  computational  expense  of  the  lower  levels,  communi¬ 
cation  is  also  neglected  in  the  model.  The  predicted  speed-up  values  are  computed  from 
(6.2),  (6.3),  (6.11)  and  (7.2).  There  is  very  close  agreement  between  measured  and  pre¬ 
dicted  values  since  the  top  level  of  the  multilevel  algorithm  is  programmed  very  similarly 
to  the  sequential  band  matrix  factorization.  This  means  that  the  measured  speed-up  is 
very  close  to  the  ratio  of  floating  point  operations  counts. 

The  maximum  asymptotic  speed-up  of  the  parallel  LU-factorization  using  p  processors 
and  load  balancing  has  the  following  limit  which  is  easily  derived  from  (6.1a,b)  and  (6.2): 

(p  +  6)/4  for  m oo  (7.3) 

Put  into  words,  the  performance  of  the  multilevel  LU-factorization  algorithm  with  p 
processor  and  load  balancing  according  to  (6.10)  is  identical  to  the  performance  with  p-f6 
processors  and  no  load  balancing  (ni  =  n3  =  ...=nAr). 
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m\p 

2 

4 

8 

16 

32 

2 

1.99 

(2) 

2.55 

(2.56) 

3.66 

(3.67) 

5.90 

(5.89) 

10.4 

(10.3) 

5 

1.99 

(2) 

2.51 

(2.52) 

3.56 

(3.57) 

5.65 

(5.67) 

9.84 

(9.86) 

10 

1.99 

(2) 

2.50 

(2.51) 

3.52 

(3.54) 

5.57 

(5.59) 

9.69 

(9.68) 

15 

1.99 

(2) 

2.50 

(2.51) 

3.51 

(3.52) 

5.50 

(5.56) 

9.66 

(9.62) 

20 

1.99 

(2) 

2.49 

(2.51) 

3.49 

(3.52) 

5.48 

(5.54) 

9.72 

(9.59) 

Table  7.2.  Measured  and  predicted  (in  paranthesis)  values  of  maximum  asymptotic 
speed-up,  5^1;.  of  the  multilevel  parallel  LU-factorization  over 
a  sequential  band  factorization. 


m\p 

2 

4 

8 

16 

32 

2 

- 

(0.51) 

1.3 

(0.73) 

1.8 

(1.06) 

2.0 

(1.66) 

2.7 

(2.69) 

5 

1.5 

(1.2) 

1.25 

(1.36) 

2.0 

(1.91) 

3.21 

(2.94) 

5.11 

(4.73) 

10 

1.5 

(1.5) 

1.67 

(1.56) 

2.41 

(2.16) 

3.88 

(3.30) 

6.17 

(5.31) 

15 

1.45 

(1.58) 

1.77 

(1.60) 

2.58 

(2.21) 

4.07 

(3.37) 

6.60 

(5.41) 

20 

1.59 

(1.61) 

1.78 

(1.61) 

2.65 

(2.23) 

4.11 

(3.39) 

6.67 

(5.44) 

Table  7.3.  Measured  and  predicted  (in  paranthesis)  values  of  speed-up  of  the  multilevel 
parallel  solution  algorithm  over  a  sequential  forward-backward  band  substitution 
algorithm.  The  speed-up  S5 applies  to  the  minimum 
dimension  problem,  N=2p-1  and  n=mN. 

In  Table  7.3  the  speed-up  values  of  the  parallel  solution  algorithm  are  given  for  the 
minimum  dimension  problem.  The  solution  algorithm  examplified  by  Fig.  5.2  is  applied 
to  a  multilevel  LU-factorization  produced  by  the  parallel  multilevel  LU-factorization 
algorithm. 

The  meaisured  speed-up  values  in  Table  7.3  for  m=2  and  m=5  with  p<8  are  rather 
inaccurate  because  of  the  resolution  of  5  msec  in  the  execution  time  metisurements. 

Disregarding  the  inaccurate  speed-up  meeisurements,  the  observed  speed-up  is  consis¬ 
tently  greater  than  the  predicted  speed-up  for  p>2.  This  somewhat  surprising  result  was 
traced  to  an  inadvertent  exploitation  of  processor  parallelism  at  the  lower  levels  of  the 
parallel  solution  algorithm.  Floating  point  computation  on  the  802S7  processor  and  index 
computation  on  the  80286  processor  were  to  a  certain  degree  overlapped  in  the  parallel 
algorithm  and  not  in  the  sequential  solution  algorithm. 

There  was  no  attempt  to  optimize  either  the  sequential  or  the  parallel  implementation, 
and  very  careful  optimization  could  probably  improve  the  speed  by  25%-50%.  However 
communication  would  still  be  insignificant  for  problems  large  enough  to  justify  the  use 
of  a  parallel  computer.  The  experimental  implementation  therefore  fulfils  its  purpose  of 
demonstrating  the  fejisibility  of  the  multilevel  algorithm. 

The  predicted  asymptotic  speed-up  involves  the  computation  of  dTpcv/dn  which 
is  computed  as 


dTpi^uldn  =  (5F2d_i/5n2d_i)  (5n2d_i/5n)  TV 
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Because  of  the  nature  of  the  load  balance  algorithm,  we  have  the  following  result  for 
n>  ni  : 

dTpwldn  =  {dFrldtir)  (drir/dn)  7>  (7.4) 

for  r=  1,  3,  N. 

can  now  be  expressed  as 

Sru  =  {dTBLu/dn)/[idFJdn^)  {dn,/dn)TF]  =  {dnjdn)-^ 

The  predicted  asymptotic  speed-up  for  the  multilevel  solution  algorithm  with  load 
distribution  for  optimum  L  U-factorization  speed  can  be  expressed  as 

ST  =  (dTssIdn)  /  [(a  (A  +  A)  /dn,)  {dm/dn)  Tp]  =  {dn,/dn)-^ 
which  is  equal  to  S^u- 

Since  load  balance  is  not  for  optimum  solution  algorithm  speed,  a  relation  similar  to 
(7.4)  is  .lot  vahd.  The  partitioning  rii  is  approximately  twice  as  large  as  the  optimum 
value.  Therefore  we  have  dFps/dn  =  d  (^Fi  +  Fij  /dn. 

A  table  analogous  to  Table  7.2  with  measured  and  predicted  values  of  ST  for  load  dis¬ 
tribution  for  optimum  LU-factorization  was  constructed.  As  expected,  it  was  essentially 
identical  to  Table  7.2,  and  it  was  therefore  omitted. 

Load  distribution  is  chosen  to  be  optimum  for  LU-factorization  since  this  is  close  to 
minimum  execution  time  for  one  LU-factorization  followed  by  one  solution  step. 

When  one  LU-factorization  is  computed  followed  by  a  large  number  of  solution  steps 
(e.g.  Newton  iteration)  it  may  be  advantageous  to  load  balance  for  optimum  solution 
speed.  In  this  case  we  have; 

ST  (P  +  2)  /2  for  m  — >  oo 

This  speed-up  is  almost  twice  cis  large  as  the  corresponding  limit  value  when  load  distri¬ 
bution  is  with  respect  to  LU-factorization  as  stated  in  (7.3). 

7.3  Row  interleaved  factorization  and  solution 

An  alternative  parallel  LU-factorization  and  solution  approach  is  the  row  interleaved 
algorithm  [8].  Contrary  to  the  multilevel  algorithm,  the  row  interleaved  algorithm  does 
not  pay  any  computational  penalty  for  the  parallelization,  only  communication  penalty. 
Each  pivot  row  must  be  broadcast  to  all  processors  to  permit  parallel  factorization.  The 
execution  time  model  for  the  row  interleaved  LU-factorization  is  cis  follows; 

Trlu  =  n  [m  (2m  -|-  1)Tf/2'^  -(-  </(7’o  -f  8m/.0)] 

The  row  interleaved  LU-factorization  can  be  compared  with  the  multilevel  LU-factorization 
by  comparing  asymptotic  speed-ups.  For  small  values  of  m  we  have  STu  >  STlu  where 
Srlu  defined  analogously  to  STu  • 
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=  {^BLu/dn) !  {dTfiLu/dn) 


Table  7.4  gives  the  integer  m-values  for  which  the  algorithms  break  even,  S'^^ 


S 


oo 

RLU' 


p 

4 

8 

16 

32 

m 

13 

15 

21 

31 

Table  7.4.  Integer  values  of  m  for  which  row  interleaved  and  multilevel 
LU-factorizations  break  even  in  asymptotic  speed-up. 

The  entries  of  Table  7.5  are  computed  from  the  equation 

dTfiiuldn  =  dTpLul^n 

where 

dTpLuldn  «  4m  (2m  +  1)  Tp/  (2'^  +  b) 

This  value  is  a  good  approximation  assuming  load  distribution  and  "large”  m. 

For  m-values  larger  than  those  given  in  Table  7.4,  the  row  interleaved  LU-factorization 
will  be  superior  to  the  multilevel  algorithm.  For  p=2,  the  multilevel  algorithm  is  always 
superior. 

The  values  of  table  7.4  were  compared  with  measurements  of  an  implementation  of 
the  row  interleaved  algorithm  and  measurements  of  the  multilevel  algorithm.  The  mea¬ 
surements  resulted  in  m=14  and  m=20  for  p=8  and  p=16,  respectively.  This  is  in  good 
agreement  with  the  predictions  of  the  model. 

The  row  interleaved  solution  algorithm  performs  very  poorly  since  the  number  og 
broadcasts  is  the  same  as  for  the  LU-factorization  while  computation  is  0(m)  for  each 
broadcast  for  the  solution  algorithm  compared  with  0(m*)  for  the  LU-factorization. 

Concluding  the  comparison  of  multilevel  and  row  interleaved  algorithms,  the  former 
is  superiour  for  narrow  band  problems  while  the  latter  takes  over  for  wide  band  prob¬ 
lems.  The  LU-factorizations  break  even  for  the  m-values  given  in  Table  7.5.  For  one 
LU-factorization  and  one  solution  step,  the  m-values  corresponding  to  break  even  will 
increase. 

It  is  obvious  that  the  m-values  of  Table  7.4  are  sensitive  to  the  communication  and 
computation  performance  of  the  parallel  computer  as  modeled  by  To,  B  and  Tp.  However, 
there  is  no  trend  in  parallel  computer  technology  towards  a  substantial  shift  of  the  break¬ 
even  values  of  m. 

Finally,  the  multilevel  solution  method  and  the  implementation  techniques  on  the 
hypercube  discribed  in  this  paper  should  also  be  applicable  to  the  other  members  of  the 
family  of  permutations  for  parallel  solution  of  block  tridiagonal  matrices  proposed  in  [7]. 
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